Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAS: KEP update for API #3237

Merged
merged 1 commit into from
Oct 16, 2024
Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Oct 15, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

To reflect API decision changes during implementation:

  • better comments
  • change the TopologyAssignment in workload:
    • use domain rather than group consistently
    • break nodeLabels for domains into order list of keys common for all domains, and ordered list of values. This saves a lot of space for large assignments, avoiding duplication of the keys.
    • included example here

Which issue(s) this PR fixes:

Part of #2724

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Oct 15, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 15, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Oct 15, 2024

/cc @tenzen-y @PBundyra

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 15, 2024
@k8s-ci-robot k8s-ci-robot requested a review from PBundyra October 15, 2024 14:16
Copy link

netlify bot commented Oct 15, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 864e36a
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/670fdb9ccc43e400084b493a

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 15, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Oct 15, 2024

/assign @PBundyra @tenzen-y

@mimowo
Copy link
Contributor Author

mimowo commented Oct 15, 2024

/cc @mwielgus @alculquicondor

Groups []TopologyAssignmentGroup `json:"groups"`
Domains []TopologyDomainAssignment `json:"domains"`

// levels is an ordered list of keys denoting the levels of the assigned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TopologyUngater needs the information about the keys to construct full node selector when ungating pods.

This information is also present in the Topology object, but I think it is useful to also put it here.

Otherwise we need to complicate the code of the Ungater to lookup via ResourceFlavor to the Topology object, and deal with possible deletions / changes of the API objects in the meanwhile.

The list has at most 5 elements (levels), so the Workload object size should not be an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, put this field above domains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// +kubebuilder:validation:MinItems=1
NodeLabels map[string]string `json:"nodeLabels"`
type TopologyDomainAssignment struct {
// values is an ordered list of node selector values describing a topology
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be clearer if you duplicated this description in TopologyAssignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've updated the TopologyAssignment description to include the information on how the node selector is specified. PTAL.

//
// +required
// +listType=atomic
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be a bit more generous here and bump it to 8, so that it covers all possible use cases, and and the same time has a sane max limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done

@PBundyra
Copy link
Contributor

/lgtm
/hold so @tenzen-y can take a look

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 16, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: cb7590a3326b99750053131557778fc63df34b2f

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2024
@mwielgus
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b6669556f015d2be83940c4a9d0a8277537c2614

// +kubebuilder:validation:MinItems=1
NodeLabels map[string]string `json:"nodeLabels"`
// +kubebuilder:validation:MaxItems=8
Values []string `json:"values"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the values are a slightly generic and It's hard to recognize the objective for this field since the topologyDomainAssignment does not have the context for the nodeSelector.

Will we receive the requests except nodeSelector keys here?

Copy link
Contributor Author

@mimowo mimowo Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the values are a slightly generic and It's hard to recognize the objective for this field since the
topologyDomainAssignment does not have the context for the nodeSelector.

I agree, this is why I mention the node selector in the comment. We could name the field nodeSelectorValues but it seems verbose.

Will we receive the requests except nodeSelector keys here?

The node selectors are create by combining only the keys (.levels field) and values (.domains.values) fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we receive the requests except nodeSelector keys here?

The node selectors are create by combining only the keys (.levels field) and values (.domains.values) fields.

Sorry, I meant the nodeSelector values.

I agree, this is why I mention the node selector in the comment. We could name the field nodeSelectorValues but it seems verbose.

In that case, How about the levelValues since these values are for the levels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, How about the levelValues since these values are for the levels?

As a name it sgtm, but maybe the only potential downside is grpc size if we have many domains assigned? I guess for big workloads it could be 10k domains or more (if the lowest level is individual node). WDYT?

Copy link
Member

@tenzen-y tenzen-y Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that levelValues does not significantly impact on the gRPC message size since the field name increase just 0.05 MiB in the 10k domains situations.

Additionally, I guess that the TAS will be used for the Top of Rack Switch (ToR) or Leaf or Spine switches instead of Nodes since the topology often needs to be considered based on the network topology, right?

Let me know if you have the actual or imaginable situations where we need to construct the topology based on the Nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I guess that the TAS will be used for the Top of Rack Switch (ToR) or Leaf or Spine switches instead > of Nodes since the topology often needs to be considered based on the network topology, right?

Later maybe yes, for now we just want to ensure the pods are lending on closely connected nodes.

I think there could be use cases to assign at the level of nodes to ensure there is no issue with fragmentation of quota. For example, if you assign at the level of rack, it could happen that a Pod can fit in a rack, but cannot fit on an individual node. I guess it will need to depend on the user preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that levelValues does not significantly impact on the gRPC message size since the field name increase just 0.05 Mbit in the 10k domains situations.

Yes, it is not much, but the overall API object size in etcd is 1.5Mi which is not much too, so the gain around 3% of the max size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if you assign at the level of rack, it could happen that a Pod can fit in a rack, but cannot fit on an individual node. I guess it will need to depend on the user preference.

I feel that this usage beyonds the initial TAS motivations. But, I know that we received similar requests here: #3211

Yes, it is not much, but the overall API object size in etcd is 1.5Mi which is not much too, so the gain around 3% of the max size.

Yes, that's true. Here, key problem is which disadvantages should we take. I think that both selection have the below disadvantages:

  • values: This lacks context and background that this field is for the values for nodeSelectors defined in the level. After we read the API document, we can understand the purpose for this field.
  • levelValues: This will increase the Workload object size. This may cause the etcd performance issues.

Based on the fact that Workload object is for Kueue developer and not exposed to the batch users, I am leaning toward accepting values. In that case, could you mention this values vs levelValues discussion on the Alternative? I guess that we can revisit this alternative based on user feedback like batch admins often struggle caused by this field name?

// +required
// +listType=atomic
// +kubebuilder:validation:MinItems=1
Groups []TopologyAssignmentGroup `json:"groups"`
// +kubebuilder:validation:MaxItems=8
Levels []string `json:"levels"`
Copy link
Member

@tenzen-y tenzen-y Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which objectives does this field "Storing desired state" or "temporary cache for the computing by topologyUngator"?
I guess that this is for the "temporary cache for the computing by topologyUngator".

In that case, could we put this information in the internal cache instead of CRD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is desired state to be reached by the Pods. We cannot use internal state, because then it would be lost on Kueue restart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that the topologyUngator can not obtain levels from Topology CR when the Kueue restarts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see what you mean. Similar question was asked in this comment.

It makes the implementation of TopologyUngater simpler (as it would require lookups via RF API Topology API, and dealing with changes to the object), at the very small cost in terms of object size as the number of levels is quite limited.

Also, I think it is conceptually consistent with what we do here as we store the mapping between resources and flavors, rather than reading resources from the CQ API object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it would require lookups via RF API Topology API, and dealing with changes to the object

I think that this could be resolved by the dedicated internal cache to store the relarationships between RF and Topology.

at the very small cost in terms of object size as the number of levels is quite limited.

Yes, I agree with you. My doubt was that the field is just cache since we can obtain from another objects.

Also, I think it is conceptually consistent with what we do here as we store the mapping between resources and flavors, rather than reading resources from the CQ API object.

Uhm, I see. Indeed the levels (new field) and flavors (existing field) field is stored in the Workload status, and it seems that reduce the multiple API calls (RF -> Topology).

When we graduate API version to v1beta2, we may want to restructure the Workload status field so that we can store the actual assigned flavor and topology information in a same level field.

Anyway, I'm ok with adding the levels here. Thanks.

Copy link
Contributor Author

@mimowo mimowo Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this could be resolved by the dedicated internal cache to store the relarationships between RF and Topology.

sure, it looks possible, I get the point, but I think it would complicate the implementation to maintain the cache, rather than reading from the object.

When we graduate API version to v1beta2, we may want to restructure the Workload status field so that we can store the actual assigned flavor and topology information in a same level field.

Sure, I'm ok to add it to the KEP to re-evaluate this decision.

Anyway, I'm ok with adding the levels here. Thanks.

Thanks!

Copy link
Contributor Author

@mimowo mimowo Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we graduate API version to v1beta2, we may want to restructure the Workload status field so that we can
store the actual assigned flavor and topology information in a same level field.
Sure, I'm ok to add it to the KEP to re-evaluate this decision.

let me know if you want that to be added, though I think it is unlikely to be dropped. For example, if we read from cache the keys could be changed in the meanwhile, and then the domain assignement would be corrupted. If we know the keys at the moment of assignment we could compare them with the ones in the cache (maintained based on the Topology API) and evict the workload.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, it looks possible, I get the point, but I think it would complicate the implementation to maintain the cache, rather than reading from the object.

Yeah, that's true. I think that this is a trade-off between implementation costs and API maintenance costs since the dedicated cache could prevent exposing API and easily break the structure, but it has implementation costs.

Anyway, I do not claim a dedicated cache now.

For example, if we read from cache the keys could be changed in the meanwhile, and then the domain assignement would be corrupted.

This could be preventented by the event based cache updating same as today cache package (/pkg/cache).
I'm wondering if we can put the dedicated cache mechanism instead of the status field on the alternative section.
But this is nonblocking of this PR merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update the alternative section in a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2024
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update!
There are not blocker, and some commenters can be addressed in the follow-ups.
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: db598e6a06d554912fabf11c8e6fba23db305b46

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 16, 2024
@k8s-ci-robot k8s-ci-robot merged commit 13e6b2b into kubernetes-sigs:main Oct 16, 2024
7 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Oct 16, 2024
PBundyra pushed a commit to PBundyra/kueue that referenced this pull request Nov 5, 2024
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants